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Abstract 

Background: Several large-scale gene co-expression networks have been constructed successfully for predicting 
gene functional rmodules and cis-regulatory elerments in Arabidopsis {Arabidopsis tlioliono). However, these 
networks are usually constructed and analyzed in an ad hoc manner. In this study, we propose a completely 
parameter-free and systematic method for constructing gene co-expression networks and predicting functional 
modules as well as cis-regulatory elements. 

Results: Our novel method consists of an automated network construction algorithm, a parameter-free procedure 
to predict functional modules, and a strategy for finding known cis-regulatory elements that is suitable for 
consensus scanning without prior knowledge of the allowed extent of degeneracy of the motif We apply the 
method to study a large collection of gene expression microarray data in Arabidopsis. We estimate that our co- 
expression network has -94% of accuracy, and has topological properties similar to other biological networks, such 
as being scale-free and having a high clustering coefficient. Remarkably, among the -300 predicted modules 
whose sizes are at least 20, 88% have at least one significantly enriched functions, including a few extremely 
significant ones (ribosome, p < lE-300, photosynthetic membrane, p < 1.3E-137, proteasome complex, p < 5.9E- 
126). In addition, we are able to predict cis-regulatory elements for 66.7% of the modules, and the association 
between the enriched cis-regulatory elements and the enriched functional terms can often be confirmed by the 
literature. Overall, our results are much more significant than those reported by several previous studies on similar 
data sets. Finally, we utilize the co-expression network to dissect the promoters of 19 Arabidopsis genes involved 
in the metabolism and signaling of the important plant hormone gibberellin, and achieved promising results that 
reveal interesting insight into the biosynthesis and signaling of gibberellin. 

Conclusions: The results show that our method is highly effective in finding functional modules from real 
microarray data. Our application on Arabidopsis leads to the discovery of the largest number of annotated 
Arabidopsis functional modules in the literature. Given the high statistical significance of functional enrichment and 
the agreement between cis-regulatory and functional annotations, we believe our Arabidopsis gene modules can 
be used to predict the functions of unknown genes in Arabidopsis, and to understand the regulatory mechanisms 
of many genes. 
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Background 

Transcriptome analysis is the key of functional genomics 
research, and is often the hub of integrative analysis of 
-omics data. High-throughput expression profiling tech- 
niques such as DNA microarray [1] and RNA-seq [2] 
have resulted in thousands of gene expression data sets, 
each containing dozens to hundreds of experiments, 
being deposited into public databases such as the NCBI 
Gene Expression Omnibus (GEO) [3]. To effectively 
exploit this wealth of data, however, there is an urgent 
call for systematic methods to integrate data across mul- 
tiple experiments. 

Recently, there has been a surging interest in producing 
gene co-expression networks from microarray data, which 
have been shown as an important and useful technique in 
discovering knowledge from gene expression microarray 
data, with many interesting results being reported [4-18]. 
In a co-expression network, the nodes are genes and the 
edges indicate similar expression patterns between genes, 
according to some similarity metric. Co-expression is 
often correlated with functional relationships, such as phy- 
sical interaction between the encoded proteins, or logical 
interaction in related biochemical and signaling pathways 
[19-22]. It has also been shown that gene co-expression 
networks have scale-free topology and community struc- 
ture, similar to other biological networks [4,22]. Further- 
more, it appears that genes with more co-expression links 
tend to be more evolutionarily conserved and essential 
[22]. Co-expression networks can be constructed in a 
number of ways, most of which involve some ad hoc para- 
meters. The majority of the existing methods for con- 
structing co-expression networks are based on some 
similarity threshold: two genes are connected by an edge 
whenever the similarity (or some transformation of it) 
between their expression levels is above a certain value 
[4,5,7-11,14-18,23]. This threshold is usually dataset 
dependent, although a few ideas have been proposed to 
help in the automatic selection of the threshold [5,9]. 
A problem with such threshold-based approachs is that 
different biological processes may show different levels of 
co-expression. Therefore, it is unlikely that a single thresh- 
old can be used to define all co-expression links. Recently, 
we and others proposed an asymmetric /:-nearest-neighbor 
(aKNN) -based approach to construct gene co-expression 
networks [12,13,24,25]. Basically, for each gene^, we con- 
nect it to k other genes whose similarity to g is ranked the 
top k among all the genes. The advantage of this approach 
is that two genes sharing only weak expression similarity 
may be linked. We showed that a small k is needed to 
keep the whole network connected, and partitioning the 
network can result in higher module prediction accuracy 
than conventional clustering algorithms [24]. A problem 
with this approach, however, is that the microarray data 
needs to be preprocessed so that genes unrelated to the 



process of interest are removed before the construction of 
the network, to prevent them from being accidentally 
included in the network. 

In this study, we propose a mutual k-nearest neighbor 
(mKNN) approach, which solves the problem of unspecific 
connections in the aKNN network, and is robust to 
random noise and scatter genes. We also propose a strat- 
egy to automatically determine the optimal k for con- 
structing gene co-expression networks based on network 
topologies. We then apply a parameter-free modular dis- 
covery algorithm that we have developed previously [26] 
to partition the network into relatively dense subnetworks 
as candidates of functional modules. We also propose a 
cis-regulatory element finding algorithm that is suitable 
for consensus scanning without prior knowledge of the 
allowed extent of degeneracy of the motif. We applied the 
method to construct and analyze a whole-genome gene 
co-expression network for Arabidopsis {Ambidopsis thali- 
ana) using more than one thousand microarray experi- 
ments. Prom the network we identified many interesting 
modules that are functionally coherent and potentially co- 
regulated. Remarkably, the functional modules we pre- 
dicted are statistically much more significant than those 
reported by previous studies on similar data sets. In addi- 
tion, we have predicted cis-regulatory elements for many 
of the functional modules, and the relationship between 
the cis-regulatory elements and the functional modules 
can often be confirmed by published results. Our results 
lead to the discovery of the largest number of Arabidopsis 
functional modules in the literature. Given the high statis- 
tical significance of Gene Ontology enrichment and the 
agreement between cis-regulatory and functional annota- 
tions of these genes modules, we believe that the results 
can be utilized to predict the functions of unknown genes 
in Arabidopsis, and to understand the regulatory mechan- 
isms of many genes. As a proof of concept, we used the 
co-expression network to dissect the promoters of gibber- 
ellin metabolism and signaling genes, with some promising 
results that reveal new insight into the biosynthesis and 
signaling of the important plant hormone gibberellin. 

Results and discussion 

We used a large collection of Arabidopsis gene expression 
microarray data that include 1388 microarrays for various 
growth conditions, developmental stages, and tissues of 
Arabidopsis [27,28]. The high quality of this collection of 
microarray data and diverse experimental conditions allow 
us to construct a global gene co-expression network that 
captures true functional relationship between pairs of 
genes. 

Overall network properties 

Applying the automated network construction method, 
we obtained a gene co-expression network with k = 100, 
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determined automatically by two topological measures 
(see Methods). The network contained 707602 edges, 
and a giant connected component (GCC) that included 
21373 (-95%) of the 22591 genes assayed by the micro- 
array. The next largest connected component had only 
3 genes, while 1150 genes had no connections at all, 
which we omitted from further analysis. The mean and 
median vertex degree of the GCC is 33 and 26, 
respectively. 

To compare, we randomized the gene expression data 
set and applied the same network construction method 
(with k = 100) to obtain a random network. It is impor- 
tant to note that this random network is not a random 
rewiring of the real network. In general, random rewiring 
would completely destroy the local modular structure of 
real networks, while a network constructed from rando- 
mized data would still have some modularity, because of 
the transitivity of most similarity measurement (i.e., a is 
close to b and b is close to c imply that a is close to c). 
We found that the real network and the random network 
had very different statistics. The largest connected com- 
ponent in the random network only contained 3183 
genes, and 18187 genes has zero connections. Overall, 
this random network only contained 42354 edges; there- 
fore, we estimated that our co-expression network had 
-94% of accuracy. 

As shown in Figure 1(a), the real network seems to 
follow a power-law degree distribution, with an expo- 
nential truncation, which is common for many real- 
world networks. Note that the frequency of nodes with 
degree of 100 is artificially inflated, because the maxi- 
mum degree allowed by the method is 100. The random 
network also appears to have a power-law degree distri- 
bution, but the network contains much fewer nodes of 
maximum degree. This shows that the requirement for 
two vertices to confirm their ranks eliminated most of 
the noise connections. 

Furthermore, the real network has a much higher clus- 
tering coefficient than the random network (0.4001 vs. 
0.1384). To show that the difference between the cluster- 
ing coefficients of the two networks is not due to their 
sizes, we randomly sampled a subnetwork of size 3183 
from the real network, and determined that its clustering 
coefficient is 0.399 ± 0.0199. Therefore, the real network 
has strong modularity that cannot be explained by the 
transitivity of the similarity measure. Also, after randomly 
rewiring, the clustering coefficients of both networks 
approaches zero. 

Finally, it is worth mentioning that both the real net- 
work and the random network have similarly high degree 
correlations (0.6132 and 0.6084, respectively). This is dif- 
ferent from asymmetric nearest neighbor networks [24], 
which typically has a negative degree correlation (data not 



shown). This may be an important property to consider 
when designing algorithms for analyzing these two types 
of networks. 

Enriched gene ontology terms 

Using HQcut, we find 1473 modules from the largest 
component of the network, with sizes from 2 to 175. 
Many of them have small sizes. Module sizes follow a 
power-law distribution for the region between 10 and 
100 (Figure 1(b)). Overall, there are -800 modules of size 
at least 10, and -300 modules of size at least 20. Gene 
Ontology analysis revealed that many of the modules 
have significantly enriched functions. Among the -800 
(300) modules whose sizes are at least 10 (20), 81.1% 
(88.0%) of them have at least one enriched function, with 
a Bonferroni corrected /7-value < 0.05. 

Table 1 shows 15 modules with the most significant 
Gene Ontology enrichment (top half), and 11 selected 
modules with significantly enriched cis-regulatory ele- 
ments (bottom half), which will be discussed in the next 
subsection. The functional enrichment is extremely sig- 
nificant for some modules. For example, we have found 
several modules where the majority of genes are involved 
in the same specific functions (cl402, ribosome, p < lE- 
300; cl473, photosynthetic membrane, p < 1.3E-137; 
cl051, proteasome complex, < 5.9E-126). These mod- 
ules also had statistically over-represented cis-regulatory 
elements, and the association between the functional 
modules and the cis-regulatory elements for many mod- 
ules can be confirmed using previously published obser- 
vations (see next subsection). 

Figure 2 shows a subnetwork that contains the top 40 
modules with the highest statistical significance of 
enrichment of Gene Ontology terms. It is evident that 
most modules correspond to densely connected subnet- 
works. However, some of the modules appear to have 
close to linear structures, for example, cll09 (heat shock 
protein binding, p < 7E-19) and c587 (CCAAT-binding 
factor complex, p < lE-22). This indicates that our 
method is able to identify not only densely connected 
functional modules, but also those sparsely connected, 
pathway-like structures. In addition, modules that are 
enriched with similar functions are often highly con- 
nected. For instance, c973, c991, and c992 are all 
involved in cell cycle, and cl473, cl474, cl434 and cl463 
are involved in photosynthesis. These highly intercon- 
nected functional modules confirm a hierarchical organi- 
zation of cellular functions [29]. 

Enriched cis-regulatory elements 

We annotated each module with a list of known cis-regu- 
latory elements (motifs) from the PLACE database that 
are over-represented in the promoter sequences of the 
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Figure 1 Network properties, (a) Degree distribution of tine co-expression networl< constructed from real data or randomized data, (b) Module 
size distribution. 



genes in the module (see Methods). Overall, 66.7% of the 
modules with size > 20 have at least one over-represented 
cis-regulatory element with a nominal p-vdlue < 0.001. 

Table 1 shows the most over-represented cis-regulatory 
element for each module. Note that statistical signifi- 
cance of the over-representation of cis-regulatory ele- 
ments is typically much weaker than that of GO terms; 
this is because cis-regulatory elements are short and 



degenerate, and as a result may appear in promoter 
sequences simply by chance. Nevertheless, based on the 
information from the PLACE database [30] , we find that 
many of the associations between the functional modules 
and the enriched cis-regulatory elements can be 
explained. For example, c453 is enriched with heat 
response genes, while the most significant motif in the 
module is a heat shock element [31]. Module c992 
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Table 1 Most significant modules according to function or motif 



ID 


Size 


Enriched Function 


p- value 


Enriched Motif 


p- value 


C1402 


174 


structural constituent of ribosome 


<lE-300 


UPIATMSD 


<1E-16 


C1473 


1 10 


photosynthetic membrane 


1.30E-137 


ACGTR00T1.1 


4.00E-15 


C1051 


70 


proteasome complex 


5.90E-126 


SITEIIATC^C 


6.00E-06 


c1474 


154 


plastid 


3.70E-93 


UPIATMSD 


1 .40E-08 


c453 


91 


response to heat 


1 .20E-54 


HSE 


3.80E-1 1 


c992 


47 


nucleosome assembly 


1 .20E-50 


OCEP/PEINTHISTONE 


2.30E-14 


c619 


47 


mitochondrion 


5.40E-50 






C1434 


53 


plastid 


5.10E-41 


UPIATMSD 


5.60E-05 


C1463 


56 


chloroplast thylakoid 


5.60E-38 






c620 


65 


mitochondrion 


1 .OOE-32 


SITEIIATCYTC 


1 .60E-07 


c1090 


30 


RNA splicing 


1.70E-31 






c991 


45 


DNA metabolic process 


2.20E-31 


E2FAT 


l.lOE-06 


c148 


112 


nutrient reservoir activity 


6.60E-31 


RYREPEATBNNAPA 


4.50E-12 


c1257 


55 


endoplasmic reticulum 


7.30E-30 


UPRMOTIFIIAT 


3.70E-1 1 


c973 


134 


microtubule motor activity 


8.20E-30 


MYBCOREATCYCBl 


6.10E-1 1 


c701 


17 


aromatic compound metabolic process 


8.00E-24 


LlDCPALl 


5.90E-08 


c294 


26 


response to water 


2.60E-22 


DRECRTCOREAT 


5.10E-06 


c778 


99 


circadian rhythm 


2.00E-17 


EVENING AT 


2.20E-16 


c711 


18 


response to auxin stimulus 


6.70E-15 


MYCATRD22 


4.90E-05 


c488 


72 


defense response 


7.10E-15 


CGCGBOXAT 


2.30E-10 


C1369 


59 


ribonucleoprotein complex biogenesis and assembly 


8.00E-15 


UP2ATMSD 


<1E-16 


c489 


81 


response to abiotic stimulus 


9.30E-09 


CGCGBOXAT 


<1E-16 


c493 


25 


glutathione transferase activity 


7.00E-07 


0CSELEMENTAT.4 


7.80E-16 


c316 


9 


abscisic acid mediated signaling 


1.90E-06 


ABREATRD22 


l.OOE-06 


c140 


36 


embryonic development ending in seed dormancy 


1.60E-05 


ABRERATCAL 


4.00E-10 


c302 


14 


response to abscisic acid stimulus 


1.90E-04 


ABRE3HVA1 


9.20E-07 



contains nucleosome assembly genes and is enriched 
with OCETYPEINTHISTONE, a composite motif known 
to be involved in regulating S phase-specific expression 
of a histone gene [32]. Module c991 has function in DNA 
replication, and is enriched with binding sites for the E2F 
family of transcription factors, which play a major role in 
regulating cell cycles [33]. Module cl257 contains genes 
associated with the endoplasmic reticulum (ER); the 
most significant motif in the module is UPRMOTIFIIAT, 
a cis-acting element regulating the unfolded protein 
response, which is activated in response to an accumula- 
tion of unfolded or misfolded proteins in the ER [34,35]. 
Another cell-cycle related module, c973, is enriched with 
MYBCOREATCYCBl, a core cis-regulatory element for 
the Arabidopsis cyclin Bl:l gene [36]. Module c701 is 
involved in aromatic compound metabolic process and is 
enriched with LlDCPALl, a cis-regulatory element initi- 
ally identified in a phenylalanine ammonia-lyase gene of 
Daucus carota (carrot) [37]. Module c294 contains water 
responsive genes and is enriched with a dehydration- 
responsive element, DRE/CRT [38,39]. For module c778, 
which is enriched with genes regulating circadian 
rhythm, the most significant motif in the module is EVE- 
NINGAT that is important for conferring rhythmicity to 



gene expression [40]. The OCSELEMENTAT motif 
enriched in c302 was found in the Arabidopsis glu- 
tathione S-transferase gene [41,42]. Finally, a few mod- 
ules contain genes having functions in abiotic stress 
responses or embryonic development (c711, c488, c489, 
c316, c302, cl40), while the corresponding cis-regulatory 
elements are either the well-known abscisic acid (ABA) 
responsive elements (ABREs) [43] or the ubiquitous 
CGCG-box, which is known to be involved in multiple 
signaling pathways in plants [44]. Interestingly, ABRE, 
UP1/2ATMSD, SITEIIATCYTC, and several other motifs 
have occurred in multiple modules, indicating that they 
may be involved in regulating multiple processes. 

Overall, a total of 177 unique motifs were found to be 
statistically significant in 144 modules, with a false dis- 
covery rate (FDR) < 0.1. Figure 3 shows the final tran- 
scriptional regulatory network in Arabidopsis, where a 
circle represents a gene module, and a triangle represents 
a motif. The size of a node is proportional to its module 
size or number of modules it regulates. The motif clus- 
ters in the center of the left subnetwork contain many 
ABRE and gibberellin (GA)-related motifs, which is 
understandable as ABA and gibberellin regulate many 
biological processes in plants. Several other interesting 
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Figure 2 Gene co-expression subnetworic of Arabidopsis. Subnetwork contains genes in tine top 40 functional modules with the highest 
statistical significance of enrichment of Gene Ontology terms. 
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cis-regulatory elements such as auxin response element 
(AREs), low temperature response elements (LTREs), and 
drought response elements (DREs) are loosely connected 
to the center, indicating the cross-talk among these sti- 
mulus response processes. 

Comparison with previous studies 

Several previous studies have attempted to predict func- 
tional modules in Arabidopsis, using essentially the 
same microarray data compendium, based on co-expres- 
sion networks or clustering methods [9,10,45,46]. It is 
worth noting that the previous co-expression networks 
were all constructed by some variants of the threshold- 



based methods (see Background). Remarkably, the 
enrichment of GO terms in our functional modules is 
much more significant than in all previous studies, to 
the best of our knowledge. For example, Horan et al. 
applied hierarchical clustering directly to the microarray 
data and obtained 916 clusters [45]. The most signifi- 
cant GO terms in their clusters are photosynthesis {p < 
1.3E-89), ribosome {p < 5.3E-65), and proteasome com- 
plex {p < lE-28). Mao et al. constructed a co-expression 
network using a Pearson correlation coefficient cutoff 
0.75 [10]. Using the Markov clustering algorithm (MCL) 
[47], they identified 527 clusters. The five most signifi- 
cant clusters contain genes in photosynthesis {p < 1.4E- 



Ruan et al. BMC Bioinformotics 201 1, 12(Suppl 12):S2 
http://www.biomedcentral.eom/1 471 -21 05/1 2/S1 2/S2 



Page 7 of 1 3 




G-box, ABRE/GA- 
related motifs 




Figure 3 Arabidopsis cis-regulatory network. A circle represents a gene module. A triangle represents a motif. The size of a node is 
proportional to its module size or the number of modules it regulates. 



52), protein biosynthesis {p < 5.7E-52), DNA metabolism 
{p < 9.1E-52), starch metabolism {p < 3.2E-19), and 
response to heat {p < 1.7E-17). 

Ma et al [9] and Vandepoele et al [46] have also used 
co-expression networks for predicting functional mod- 
ules, but the overall goals/strategies of their studies are 
different from ours. Ma et al. attempted to find co- 
expressed neighbors of known guide genes. The five 
most significant GO terms found by Ma et al. are 
response to heat {p < 9.4E-55), chromatin {p < 7.5E-48), 
response to auxin {p < 3.6E-41), proteasome complex 
(p < 6.7E-29), and starch metabolism {p < 6.5E-18). The 
work of Vandepoele et al. combines co-expression with 
sequence-level conservation between Arabidopsis and 
poplar. The most significant GO terms they found are 
photosynthesis {p < 2.2E-87), ribosome biogenesis and 
assembly (6.1E-68), and DNA replication (p < 8.9E-26). 



Finally it is worth noting that our network (mean ver- 
tex degree = 26) is much sparser than the network of 
Mao et al. (mean vertex degree = 165), and that of Van- 
depoele et al. (mean vertex degree = 717). Our network is 
more sparse, making it easier for analysis and visualiza- 
tion. At the same time, our network covers about 95% of 
the Arabidopsis genes, whereas the networks by Ma et al. 
and Mao et al. only cover about 30% of Arabidopsis 
genes. As a result, we are able to identify more functional 
modules than in these previous studies. 

Application: gene-centric analysis 

As an application, we used the co-expression network 
to study a set of gibberellin (GA) metabolism and 
signaling genes. The GAs are a group of plant hor- 
mones that singly or in combination with other hor- 
mones regulate many aspects of plant growth and 
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development including germination, stem elongation 
and flowering [48]. We compiled 16 Arabidopsis genes 
that encode three small families of 2-oxoglutarate- 
dependent dioxygenase (2-ODD) enzymes in the GA 
metabolic pathway (4 GA 2-oxidases, 7 GA 3-oxidases 
and 5 GA 20-oxidases), and 3 genes that encode GA 
receptors (GIDla, GIDlb, and GIDlc). The amount of 
bioactive GA which binds to the receptor to transduce 
a biological response must be closely regulated by fine 
tuning GA biosynthesis and deactivation [48]. For each 
of the genes, we are interested in obtaining their cis- 
regulatory elements. We first obtained the co-expres- 
sion neighbors for each gene. For each gene, we then 
combined itself and its co-expressed neighbors into a 
single list. We then searched for motifs that are not 
only over-represented in the list, but appeared in the 
promoter of the gene of interest (i.e., 2-ODD or GIDl 



genes). Figure 4 shows the network with the GA meta- 
bolism and signaling genes and their putative cis-regu- 
latory elements. The width of an edge is proportional 
to the significance of enrichment. We predicted cis- 
regulatory elements for 11 out of 16 GA 20DD genes 
and 2 out of 3 GA receptor genes. Abscisic acid (ABA) 
response elements (ABREs) play important roles in 
regulating several of these genes. It is well known that 
the balance between GA and ABA is an important fac- 
tor regulating the development and growth of many 
plants [49]. It can also be seen that GA20ox and 
GA3ox families, both critical in the biosynthesis of 
bioactive GA, share more common cis-regulatory ele- 
ments with each other than with members of the 
GA2ox family, which are responsible for GA deactiva- 
tion [48]. The RYREPEATBNNAPA motif is also 
important for ABA responses [50]. 
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Figure 4 Cis-regulatory networl^ of Arabidopsis gibberellin metabolism and signaling genes. Yellow and green nodes represent genes 
and cis-regulatory elements, respectively. The width of an edge is proportional to the significance of enrichment, measured by the negative 
logarithm of the p-value. The number after the dot following the motif name represents the number of mismatches allowed in order to obtain 
maximum statistical significance. 
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Conclusions 

In this study, we have proposed a novel network-based 
method for identifying gene functional modules and 
cis-regulatory elements from a large microarray data 
set of Arabidopsis. The method included a mutual k- 
nearest-neighbor network construction method with 
automatic parameter selection, a modularity-based 
parameter-free module detection algorithm, and a cis- 
regulatory element finding algorithm that is suitable 
for consensus scanning without prior knowledge of the 
allowed extent of degeneracy of the motif. Since the 
method is completely parameter free, it is especially 
useful to be applied to analyzing microarray data sets 
that are of very large scale or are assaying poorly 
understood biological processes, where the appropriate 
network parameters and number of modules are diffi- 
cult to estimate. 

Applying our method to a large collection of Arabi- 
dopsis microarray data, we have significantly improved 
the prediction accuracy of functional modules com- 
pared to several previous studies. Our application leads 
to the discovery of the largest number of Arabidopsis 
functional modules in the literature; for many modules, 
we are able to annotate them with functional terms 
and cis-regulatory elements. Together, the high statisti- 
cal significance of Gene Ontology enrichment and the 
agreement between cis-regulatory and functional anno- 
tations of these genes modules in Arabidopsis show 
that our Arabidopsis gene modules are excellent candi- 
dates of functional modules. Therefore, we believe that 
the results can be utilized to predict the functions of 
unknown genes in Arabidopsis, and to understand the 
regulatory mechanisms of many genes. As a proof of 
concept, we have used the co-expression network to 
dissect the promoters of gibberellin metabolism and 
signaling genes, with some promising results that 
reveal new insight into the biosynthesis and signaling 
of the important plant hormone gibberellin. We are 
constructing a database and web interface for querying 
the Arabidopsis gene co-expression network, the pre- 
dicted functional modules and associated cis-regulatory 
elements. 

Methods 

Data 

Gene expression microarray data were downloaded from 
The Arabidopsis Information Resource (TAIR) [51] and 
normalized according to the procedure of ATTED-II 
[52]. Promoter Sequences, defined as lOOObp upstream 
to transcription starting sites, were downloaded from 
TAIR. Known cis-regulatory elements were downloaded 
from the PLACE database [30]. 



Network construction with topology-based parameter 
selection 

We define a network as G = {F, £"}, where Vis the set of 
entities and E is the set of edges. Alternatively, we repre- 
sent a network by its adjacency matrix, W = (Wij), where 
Wij = 1 if there is an edge between and Vp and 0 other- 
wise. Let Sij be the similarity between gene i and gene /, 
where similarity in this study is measured by Pearson 
correlation coefficient. With a given parameter /c, the 
number of nearest neighbors to consider, the mutual k- 
nearest neighbor (mKNN) -based network is constructed 
by connecting any two genes that are within the top-/c 
most similar genes of each other. That is, for gene / and 
gene ; to be connected, they both need to be on the other 
gene's top-Zc list. This is different from the previous 
aKNN-based method, where two genes are connected if 
one is on the top-/: list of the other. Formally, we let Wij = 
1 if Sij > max{5^^^, 5,yJ or 0 otherwise, where i/^ is the index 
of the gene whose similarity to gene / is smaller than 
exactly k - 1 other genes. In other words, \x, x ^ i and 
>Sii^ \=k-l. 

The advantage of the mKNN methods compared to the 
threshold-based or aKNN-based methods (see Back- 
ground) can be explained by a small example in Figure 5, 
which shows a similarity matrix containing three mod- 
ules of different sizes (10, 40, and 100, respectively) and 
three networks constructed by the above methods. We 
have chosen parameters to make the three networks to 
have approximately the same density. Assume that the 
diagonal blocks (within-cluster gene pairs) in the similar- 
ity matrix are generated from the same distribution, and 
that the similarity scores in the off-diagonal regions 
(inter-cluster gene pairs) are generated from a different 
distribution. In the threshold-based method, all entries in 
the diagonal blocks have the same probability to be 
selected as edges. As a result, the expected edge density 
in different clusters will be the same. This, however, cre- 
ates a huge disadvantage for the vertices in the smaller 
clusters, as they will have a smaller number of within- 
cluster edges compared to those in larger clusters. 

Even worse, they may by chance have more inter-cluster 
edges. In the two KNN-based networks, in contrast, the 
smaller clusters usually have higher within-cluster edge 
densities, because the networks were constructed by con- 
necting each vertex to the same number of neighboring 
vertices. (This does not mean all vertices have the same 
number of edges, however.) Indeed, as shown in Figure 5, 
the smallest cluster represented by the upper left diagonal 
block has much higher edge densities in the two KNN- 
based networks than in the threshold-based network. The 
mKNN-based network also has fewer inter-cluster edges 
in the off-diagonal regions than the aKNN-based network. 
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The parameter k is determined automatically based on 
the assumption that real networks have topological 
properties that are different from random networks [53]. 
For example, real networks often have a long-tail degree 
distribution, the small-world property, and high cluster- 
ing coefficient [53]. Therefore, it is often suggested that 
these properties may be used to distinguish real net- 
works from their random counterparts [5,53]. 

The topology-based parameter selection method works 
as follows. Given a co-expression network construction 



method and a topological measure F, we first decide a set 
of possible values for the parameter (i.e., k). We then 
construct a co-expression network using each parameter 
value, and compute the F value of the resulting network. 
At the same time, we also generate a random network by 
applying the same network construction method to a 
randomly permuted copy of the original expression data, 
and compute the corresponding F value of the random 
network. We then choose the network parameter that 
maximizes the difference between F^rue and Frandom- 
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Formally, let G(A, k) be the co-expression network gener- 
ated on data set A using parameter /c, and let A'' be the 
permuted data, the optimal network G* is constructed as 
follows: 

G* = G(A, k"), where fe* = arg max(rG(A,fe) - ^c{A\k)^' (1) 

Here we consider two types of topological measures. 
The first is the clustering coefficient, defined by the fol- 
lowing formula: ^ = ^ ^^i/^ii^i ~ 1) > where N is 

the number of vertices in the network, di is the degree 
of vertex /, and rii is the number of connections between 
the neighbors of vertex L In a recent study, Elo and col- 
leagues recommended using clustering coefficient to 
choose the optimal network parameter [5]. Their experi- 
mental results were based exclusively on threshold- 
based networks. 

Furthermore, a subtle but significant difference is that 
in their method, the random network was generated by 
randomly rewiring the true network. In contrast, in our 
method, the random network was generated by applying 
the same network construction method to a randomly 
permuted data set. As a randomly rewired network has 
no modular structure at all, its clustering coefficient is 
close to zero, when the network is sufficiently sparse. In 
contrast, the clustering coefficient of a network con- 
structed from a random data set is non-negligible. In 
addition, our method searches for the parameter that 
corresponds to a global maximum value of Tf^ue - 
^randomy while their method searches for the parameter 
that corresponds to the first local maximum of Ttme - 
T random- As a rcsult, our method is less prone to noises 
than their method. The second type of topological mea- 
sure we propose is a novel measurement specific for the 
mKNN method. 

Assume that we choose parameter k in the mKNN 
method, and the average vertex degree of the resulting 
network is Uk- We define the normalized degree of the 
network as nj^lL The normalized degree for any mKNN 
network is between 0 and 1. We use the normalized 
degree as the topological measure, and apply Equation 
(1) to choose a k that maximizes the difference between 
the normalized degree of the true network and that of 
its random counterpart. The rationale is as follows. In 
the mKNN network, the normalized degree is related to 
the conditional probability p{sij > Su^ \ Sij > Sjj). Consider 
a similarity matrix where the similarity scores are com- 
pletely random, which means p{sij > Su) and p{sij > Sjj^) 
are independent. When each vertex chooses k neighbors, 
the probability for each of the k neighbors to also rank 
the current vertex as a top-Zc neighbor is exactly k/N, 
where N is the size of the network. The expected degree 



is therefore I<^/N and the expected normalized degree 
would be k^/N/k = kIN. In a non-random similarity 
matrix that has modular structures, when k is small (or 
more precisely, smaller than a typical module size), the 
k nearest neighbors of most vertices are members of 
their modules, and therefore the expected degree for 
each vertex would be l^ln, where n is the size of the 
module that the vertex is in. The average degree of the 
network would be proportional to cl^lN where c is the 
number of modules. Consequently, the normalized 
degree would be proportional to ckIN and the difference 
between the normalized degree of the true network and 
that of the random network would grown as k grows, 
until k is about the same size of a typical module. After 
that, when k increases, new neighbors for most vertices 
would be chosen primarily from outside of their module, 
randomly. The probably p{Sij > Su^ \ Sij > Sjj) now drops 
to k/N from k/n and as a result, the difference between 
the normalized degree of the true and random networks 
would decrease when k increases. 

Module detection and annotation 

Many module detection algorithms have been devel- 
oped, most of which rely on some graph partitioning 
routines. We recently developed two graph partitioning 
algorithms within the framework of community discov- 
ery, which aims to identify the most interesting "natural" 
communities (i.e., relatively dense subnetworks) without 
user-tuned parameters [26]. The first algorithm, called 
Qcut, partitions a network by optimizing a well-known 
modularity function [26]. The second algorithm, called 
HQcut, solves the intrinsic resolution limit problem of 
the modularity function by iteratively calling Qcut to 
identify communities that does not contain any statisti- 
cally significant sub-communities [26]. Here we employ 
the HQcut algorithm to the co-expression networks and 
treat the identified communities as candidates of func- 
tional modules. HQcut does not use any user-tunable 
parameters, except an optional statistical significance 
cutoff. We used a fixed cutoff (z-score = 2). Previously 
we have shown that in general the results of HQcut are 
not sensitive to this cutoff value [26]. 

We use enrichment of Gene Ontology terms to evalu- 
ate the significance of functional modules [54]. Specifi- 
cally, given a gene subnetwork s and a Gene Ontology 
term t, the p-vslue for the enrichment of ^ in 5 is esti- 
mated by the cumulative hypergeometric test: 



min(m,n) 



p(t,s)= ^ 

k=a 



m 



N - m 
n-k 



n 



(2) 
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where N is the number of genes in the genome, m is 
the size of the subnetwork, n is the number of genes in 
the genome with function and a is the number of 
genes in s with function t 

Discovery of cis-regulatory elements and construction of 
cis-regulatory network 

To estabUsh the connection between co-expression and 
CO -regulation in Arabidopsis microarray data, we explore 
the known transcription factor binding sites to find cis- 
regulatory elements (motifs) within each functional 
module. To do this, the promoter region (1000 bp 
upstream from the transcription start site) of each gene 
in a module is scanned with over 500 known motifs 
curated in the PLACE database, represented as consen- 
sus sequences [30]. The idea is that if a motif is found 
to be enriched in the genes' promoters in a module, 
then perhaps those genes are regulated by that motif. 
To account for motif degeneracy, we allow a certain 
number of mismatches during the motif scanning. For 
long consensus sequences, this is necessary because 
many transcription factor binding sites are different 
from their canonical consensus sequences. 

How to determine the number of mismatches to be 
allowed, however, is not trivial. We propose a simple 
strategy to search the optimal number of mismatches (for 
each motif) that can result in the most significant enrich- 
ment of the motif in a particular module. The sequence 
occurrence of the motif (with up to / mismatches) within 
a module is compared to that in the entire genome and 
the enrichment of the motif in the module is computed 
using the cumulative hyper-geometric test similarly as for 
testing the enrichment of Gene Ontology terms using 
Equation (2), where t now means a motif rather than a 
functional term. We vary / between 0 and L, where L is 
proportional to the length of the motif, and choose the 
optimal / that gives the most significant enrichment of 
the motif. 

The final cis-regulatory network is constructed by 
treating functional modules and motifs as vertices, and 
an edge is created between a module and a motif if the 
motif is determined to be significantly enriched in the 
module. Since we are testing thousands of motif candi- 
dates (with different number of mismatches), p-valxxes 
must be corrected for multiple hypothesis testing pro- 
blem. We therefore computed the false discovery rate 
(FDR) using Benjamini's and Hochberg's procedure [55]. 
A motif is considered significantly enriched in a module 
if the FDR is no greater than 10%. 

Gene-centric co-expression and cis-regulatory element 
analysis 

For each GA 2-oxoglutarate-dependent dioxygenase (2- 
ODD) gene or GA receptor (GID) gene, we first get its 



neighboring genes in the co-expression network; we 
then apply motif analysis to find motifs that are not 
only present in the 20DD gene promoter, but also over- 
represented in its neighboring gene promoters, where 
over-representation is determined using Equation (2). 
We used a p-v^hxe cutoff 0.001. Such motifs are putative 
cis-regulatory elements for the 20DD or GID gene. 
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